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Abstract. We discuss stability for a class of learning algorithms with respect to 
noisy labels. The algorithms we consider are for regression, and they involve the 
minimization of regularized risk functionals, such as L(f) := j? X^ili(/( x — 
2/i) 2 + A||/||^. We shall call the algorithm 'stable' if, when yt is a noisy version of 
7( x i) for some function f EH, the output of the algorithm converges to / as the 
regularization term and noise simultaneously vanish. We consider two flavors of 
this problem, one where a data set of TV points remains fixed, and the other where 
TV" — » oo. For the case where TV — » oo, we give conditions for convergence to 
fie. (the function which is the expectation of y(x) for each x), as A — > 0. For the 
fixed TV case, we describe the limiting 'non-noisy', 'non-regularized' function /, 
and give conditions for convergence. In the process, we develop a set of tools 
for dealing with functionals such as L(f), which are applicable to many other 
problems in learning theory. 

keywords statistical learning theory, learning in the limit, regularized least squares regression, 
RKHS 



1 Introduction 

In regression learning problems, we are given data (xj, yi)i=i....N in X x y where X 
is a bounded subset of R" and y is a bounded subset of M. We assume this data is cho- 
sen iid (independently and identically distributed) according to an unknown probability 
distribution fi(pc, y). We say that x is a 'position', and y is a 'label'. These data points 
may be, for example, images of people's faces in pixel space with a person's age as 
the corresponding label, or auto-regressive time series data (|4|, |6|). The output of a 
learning algorithm is a decision function / : X — ► K. Even though we only know TV 
data points from distribution /i(x, y), we hope to construct / which will be able to gen- 
eralize to unobserved points in the distribution. This means we would like / to predict 
the value of y for any given value ofx e A". Since we want our function / to fit the 
data accurately and also have this generalization ability, we refer to Vapnik's Structural 
Risk Minimization (SRM) principle (| 10|, [11 1). In SRM, we limit our choice of func- 
tions / so they are chosen from a class T, of finite 'capacity' (i.e. finite VC dimension). 



Otherwise, we cannot hope to choose a function / which has generalization ability - 
we would overfit the data. One convenient way to implement SRM is to let T be a ball 
within a Reproducing Kernel Hilbert Space (RKHS) Ti, with norm || • ||^. In this form, 
we have an Ivanov regularization problem; one can show that the solution is always the 
minimizer of a corresponding Tikhonov regularization problem. Algorithms for classi- 
fication and regression solve this Tikhonov Regularization problem, so that the decision 
function is given by f min (Q, (8|), where 



L(f) is called the Regularized Risk functional. Note that we define our RKHS TL so that 
/ £ Ti iff || /|| n is finite. Thus, minimizing over f £ His equivalent to minimizing over 
all functions /. The first term in L(f) is called the Empirical Risk, and V(-) is a pre- 
determined loss function. We will generally use the least squares loss function V(z) = 
(z) 2 , but a similar analysis can be performed for other loss functions. The second term 
is called the Regularization term, and A is called the Regularization parameter; one 
always takes A > 0. Here, A can be viewed as the trade-off between accuracy and 
generalization. If A is very small, we are minimizing the Empirical Risk, increasing 
the accuracy of our model to the data, and possibly overrating. If A is very large, our 
algorithm will generalize at the expense of accuracy. In a sense, A controls the capacity 
of the function class from which / is chosen: the larger A is, the smaller the radius of 
the ball T in H. In practice, A is often chosen empirically, perhaps to minimize the 
leave-one-out error on a training set. 

Another interpretation of this functional is through the eyes of algorithmic stability, 
as described by Bousquet and Elisseeff (|T|). Here, the regularization term prevents the 
algorithm from being sensitive to the replacement of one data point. In either case, the 
regularization problem is well-posed only when A is strictly greater than 0. 

We assume that the labels y are 'noisy', in the sense that there is a marginal distri- 
bution /i(y|x) for each x. We denote the expectation value of the label y for position x 
as E(y|x), and we denote the marginal distribution along the x-axis as /i(x). (This is 
the distribution of /i(x, y) after integrating over the y values.) 

For the case when N — ► oo, we show convergence of / m j„ to a function Je as the 
regularization term vanishes, provided /e £ H; i.e. we need to find conditions on the 
simultaneous convergence N — > oo and A — > so that f m in ~ ¥ /e- Here, the function 
/e is defined by: 



/e = argmin Actual Risk (f), where Actual Risk (/) = / (E[/(x) — y] 2 |x) dfj,(x). 



In other words, /e is the minimizer of the 'Actual Risk'. Since we are using the least 
squares loss function, this minimizer is simply the expectation of y for each x; /e(x) = 



We assume that we have chosen a RKHS which is large enough to contain /e(x). 
In other words, ||/e(x)||^ < oo. This is not an exceedingly strong assumption; in fact, 
many popular kernels (e.g. gaussian kernels) can produce RKHS of arbitrarily high VC 
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E(y|x). 



dimension. Although /e(x) may not be in Ti for every case, /k(x) will be in Ti for most 
smooth processes which have bounded noise, as long as we implement a sufficiently 
powerful RKHS. 

For the fixed N case, we may express label y for position x as the random variable 
y(x) = /(x) + 6(x), where / : X — ► y is a deterministic function assumed to be in 
Ti, and 6(x) is random noise with some probability distribution, with &(x) and &(x') 
independent if x 7^ x'. We denote the vector of noise values as b = (pi, b 2 , .., b^) = 
{6(xj)}j = i jv- In order to force the noise to vanish, we will assume the noise is gen- 
erated by a fixed random process generating noise with norm bounded by b max almost 
surely, and we will only shrink its amplitude. Since the noise is generated by this fixed 
process, the theorem will hold whenever the noise is bounded, and thus, if the noise 
is bounded almost surely, the theorem will hold almost surely. Using the least squared 
loss and making the noise explicit, our algorithm becomes: 

1 N 

f mm := argmin L(f), where L(f) = — ^(/(x 4 ) - /( Xi ) - c b t f + X\\f\\ 2 H , 
f »=i 

where ||b||^ 2 < \/~Nb ma ,x and c is a constant. 

For A > 0, the minimizer of L(f) is unique, because A||/||^ is strictly convex. Since 
the noise is random, f m in is still a random variable. Our goal is to show 'stability' for 
this algorithm, i.e. we need to find a set of conditions on the simultaneous convergence 
A, c — > which allows f m i n — > /, where / is the element of Ti with minimal norm 
that has zero Empirical Risk when noise is not present. (Since we assume that / 6 Ti, f 
itself minimizes L(f) when A = 0. Since many functions in TL vanish at all the Xj, there 
may be infinitely many functions with zero empirical risk; our algorithm will converge 
to the one with the smallest RKHS norm.) 

Intuitively, this stability analysis demonstrates that there's no inherent error in our 
algorithm when noise or regularization is present, and that a small amount of noise or 
regularization cannot dramatically disrupt the algorithm's output. This type of stabil- 
ity is different from the 'algorithmic stability' of Devroye(|3|). Algorithmic stability 
measures the variability of an algorithm's output as the data set changes. Our type of 
stability determines whether the algorithm's output changes dramatically when noise or 
regularization is present. Algorithmic stability is a property of one particular algorithm 
for one particular distribution of data. Our stability is not - the distribution changes 
as noise is removed, and the algorithm changes as the regularization term shrinks. We 
actually use algorithmic stability to help us show stability of our algorithm in this sense. 

Theorem 1 states that the regularized least squares regression algorithm is stable as 
the number of data points increases to infinity. Theorem 2 states that the regularized 
least squares regression algorithm is stable for a fixed N point data set. 



Main Algorithm (Regularized Least Squares Regression): 

For a data set Z = (x$, yi)i=i,..,iv, where Vi € 1, .., N, Xj € K n , yi G K 

1 N 

fz,x := argmin L z , x (f), where L z ,x{f) = E(/( x *) " ^ + X Wf\& 

' i=l 



Theorem 1. Denote by /e(x) the function E(y|x). Denote by Zn the data set (xj, 2/i)i=i,..,jv- 
If fm € 7~t and if A := Xn is chosen to depend on N such that Xn and NX% — > oo 
as N — > oo, f/zen we /jave convergence of the Main Algorithm: 

\\fz N ,x N - hWn -^Oas N -> oo. 

p 

Here, ' — > ' denotes convergence in probability. 



Theorem 2. Assume we are given N fixed positions xi, .., xjy. 

Suppose that for each i £ 1, .., AT, f/je labels are given by j/j = /(Xj) + j&j, where the 
bi's are independent random variables with ||b||i < V~Nb ma x almost surely. Denote 
by Z t the data set (xj, yi)i=\ t .. t N- 
Define f by: 

(i)J(x i )=f(x i )fori = l,..,N 

(H) WIWh is minimal, among all functions which satisfy (i). 

If X := X t is chosen to depend on t such that tyX~t — * oo as t — > oo ana" At — * 0, 
f/ien we have convergence of the Main Algorithm almost surely: 

\\fz t ,\ t ~ fWn — ► as t — > oo almost surely . 



Section 2 contains a short review of RKHS. Section 3 and 4 contain the proofs of 
Theorems 1 and 2. 

2 Reproducing Kernel Hilbert Space (RKHS) 

H is a real Reproducing Kernel Hilbert Space (RKHS) if fi has the following properties: 

- Ti. is a Hilbert space. 7i is a complete, inner product, real vector space of functions 
/ : X -> R. We denote H's inner product by (•, ')-^, and H's norm by || • ||^. 

- Reproducing Property. There exists a bilinear form if : X x X — ► K such that 
Vx e A", we have if (x, •) € W and (/, if (x, ■))« = /(x) for any /e J. This 
if is called the 'reproducing kernel' of the RKHS. 1 8 1|9||2|. We sometimes denote 
if(x,-)byif x . 

- Spanning Property. H — span{K(x, -)|x G X} 

Since 7Y is a real Hilbert space, (/. q) — (g, /) for all f,g E H. It follows that 
if (x, x') = (if (x, ■), if (x', ■))« = (if (x', •), K (x, ■))« = if (x', x), i.e. if is sym- 
metric in its two arguments. An equivalent definition of an RKHS is a Hilbert space of 
functions / : X — > M such that all evaluation functionals r x : / — > /(x), x € A", 



are continuous. Given xi, ..,Xjv £ X, the associated N x N Gram Matrix G has en- 
tries Gij = K(xi, Xj) where K is the reproducing kernel for the RKHS TL. The Gram 
Matrix is always a positive semi-definite matrix. 

The Representer Theorem transforms the minimization of our functional Lz,\(f) 
into an optimization problem over only N numbers. This advantage is the main reason 
why scientists take J 7 to be a ball in a RKHS H. We present a corollary of this theorem 
below. 

Corollary of the Representer Theorem (Kimeldorf, Wahba)[5 1) The function f m in — 
argmirif if ^(/( x ) — Hi) + ^ll/llw can be represented in the form f m i„ = 

y'., ctiK Xi . This is true for any arbitrary loss function V. 
(This corollary is a specific case of the full Representer Theorem 1 5 1.) 

Having described the basic facts about RKHS, we now continue with the proofs of 
Theorems 1 and 2. 



3 Proof of Theorem 1 

For the Main Algorithm above, we are increasing the size of the data set Zm as N 

p 

increases. We need to show convergence fz N .\ N — >/e> where A at — > and N — > oo. 
That is, we need to show 

Jim P{\\fz N .x N - fe\\n > v} = for every 77 > . 

We can break up the distance \\fz N ,x N — /e||w into two contributions. The first contri- 
bution is called 'variance', and it is due to the finite number of randomly chosen noisy 
data points. The variance vanishes with arbitrarily high probability as the number of 
data points increases, even if the noise does not vanish. The second contribution is the 
'bias' due to the restriction we place on our hypothesis space, i.e. the fact that / is cho- 
sen from with a ball of a RKHS. This term vanishes as the ball gets larger, i.e. when Aat 
gets smaller. 



||/z n ,Ajv — mm < II/zjv.Ajv ~ /ajvIIw + II/ajv ~ /e| 



H 



bias 



where fz N .\ N = argmin L Zn .\ n (/), 
/ 



where L ZN ,x N (f) = ^ E l= i(/( x - Vif + a at||/||h, and 

f\ N = argmin L Ajv (/), where L XN (f) = J(E[/(x) - y] 2 |x)d M (x) +X N \\ff H . 
f 

Lemma 1 . 1 below describes a method for proving that the minimizers of two convex 
functions are close in H. 

Lemma 1.1. Suppose L , L 2 : 7i — > R are two convex functionals for which there exist 
e, S so that: 



(a) V/eH; \L 1 (f) — L 2 (/)| < e 

(b) |L 1 (/)-L 1 (/ 1 )|<2 £ =^||/-/ 1 || H <<5 

77zen if the minimizers f 1 :— argmin and f 2 :— argmin L 2 (f) exist, they satisfy 

II/ 1 - PWn < 5. 

Proof. Since f 1 and f 2 are minimizers of L 1 and i 2 respectively, and using the close- 
ness condition (a): 

L 2 (f)<L 2 (f)<L\f)+s 
So,|£ 1 (/ 1 )-L a (/ 2 )|<e. 

Now, liH/ 1 ) - ^M/ 2 )! < |£ X (/ X ) - L 2 {f 2 )\ + \L 2 (f 2 ) - L\f 2 )\ < 2s, and finally, 
condition (b) will give us H/ 1 - f 2 \\n < S. □ 

Back to the proof of Theorem 1. We proceed one term at a time. 

Variance Term We will choose more general versions of Lz N ,\ N (/) and L\ N (/) tem- 
porarily. 



i N 

£z N ,x N (f) ■■= jj52v(f(x i )-y i ) + X ff \\f\$ i 



£xAf) ■■= |lE(^(/(x)-y)|x)d M (x)+A Ar ||/|| 2 , 
where \V(a) - V(b)\ < a v \a ~ b\. 

That is, we assume that the loss function V is Lipschitz continuous, or 'sigma-admissible' 
The least squares loss has ay = 2X max , since \V(a) — V(b)\ = \a 2 — b 2 \ < 

\a + b\\a - b\ < 2X max \a - b\. 

We need to verify the conditions (a) and (b) in order to use Lemma 1.1. To verify 

the closeness property (a) for our functionals £z N ,\ N (/) and C\ N (/): 



C 



(f)-CxAf) = k?£nfte)-w)- / E(V(/(x)-y)|x)dA*(x) 



Empirical Risk (/) — Actual Risk (/) 



(1) 



There are many available upper bounds for the right side of equation Q, including 
Vapnik's VC bound, which relies on the VC dimension of the class of allowed de- 
cision functions T (" 1101 . II II ). The particular bound we utilize for this paper was 
constructed by Bousquet and Elisseeff 1 1 1, and it is based on 'algorithmic stability'. 
In general, bounds of this quantity are probabilistic, and are based on some capacity 
measure of the algorithm or space of functions T. This particular bound relies on the 
sigma-admissibility of the loss function V, and McDiarmid's concentration of measure 
inequality. 



def Z N is attaining sample (xi,yi), ..,(x N ,y N ). Z l N .^ y = (Z N \(Xi,yi)) U (x,y). 

That is, we replace the i th training point in Zjy by a new data point in order to obtain 

■pri 



def The algorithm Alg : Z — > fz is uniformly [3-stable with respect to loss function 

Vp : XxVxR — ► M: 

^xy,i, and all Z^. 



^ : A'xyxR^]Rif:|^(x, 2/ ,/ Zjv (x))-^(x ) j/,/ z « (x))| < (3 for all (x, y), (x,y)e 

JV;x,y — 



Basically, this algorithmic stability measures how much the algorithm's output could 
possibly change, as measured by the loss function Vp, when we replace one data point. 

Algorithmic Stability Theorem (Bousquet and Elisseeff, (T|) If we are given a uni- 
formly [3-stable algorithm with respect to loss function Vp, which outputs functions 
bounded by the constant M (i.e. |/z(x)| < M V x e X, MZ), then for any N > 

64MN/3 + 8M 2 

P{\Empirical Risk(f) — Actual Risk(f)\ > e} < pjv,with p N = — — . 

Algorithmic Stability of Tikhonov Learning Algorithms (Bousquet and Elisseeff, 
1 1 1) The Main Algorithm is uniformly [3-stable, with (3 = j^—- Here, C is an upper 
bound on the sigma-admissibility constant ay, and k is an upper bound on the diagonal 
elements of the Gram Matrix, that is, TTlClXi Ga ^ 

Returning to the proof of Theorem 1, we now know the right side of Q is bounded 
by e (for large values of N) with probability at least 1 — p^, where: 

64MNP + 8M 2 n C 2 k 2 6UIC 2 k 2 + 8M 2 \ N 
PN = NT 2 ' Wh6re " = N^' S ° PN = ■ 

We now have the closeness condition (a) of Lemma 1.1 satisfied with probability at 
least 1 — pff, i.e. \Lz N: x N (/) — L\ N (/)| < e, with probability at least 1 — p^, where 
Pn is given in (0. 

Now we verify condition (b). We need to show that \L\ N (f) — L\ N (f\ N )\ < 2e 
implies that 1 1 / — f\ N \\i-i < 5 for every /. Let's define a function h so that h := f — f\ N 
in what follows. 



l Ajv (/) = J nh N {*) + h{*)-y\*YdK*) + ^N\\fx N + h\\ z H 

= L XN (f\ N )+\2 [ E[(/ Ajv (x)-y)ft(x)|x]d M (x)+ 2\ N (h N ,h) n 



+ I h 2 (x)dp(x) + X N \\h\ 12 



H 



The terms in the brackets are linear in h. Remember that f\ N is the minimizer of L\ N , 
and thus the linear terms must be zero. (If the linear terms are non-zero, we can reverse 



the sign of h and contradict f\ N as the minimizer.) The last two terms are always 
positive. 

LxAf) = WAj + J h 2 ( X )dfi(x) + X N \\h\\ 2 H > L XN (h N ) + \N\\h\\ 2 n 
Now we can see that (b) holds: 

II/- AX = WHn < (Lx N (f) ~ -^Ajv(Ajv)) y - < I 1 



In this case, 



Since both the conditions (a) and (b) are satisfied, Lemma 1 . 1 produces 



(2) 



\\fz N ,\ N ~ fx N \\tc — \ t with prob. at least 1 - p N , 

V Ajv 

where is given in (|2j- We are done with the variance term. 

Bias Term We will prove that the bias term vanishes using the spectral theorem. Define 
the function h(x) := /(x) — /e( x ) m what follows. Now, 



Lx N (f)= (K x ,h) 2 H d^) + / E[(/ E (x)- 2 /) 2 |x]^(x)+A Ar ||/ l + / E || 



H 



The minimizer of L\ N (/) again must have first variational derivative equal to 0. 
Using Fubini's Theorem, we find: 



dLx N (f + 75) 



d~; 



_ d_ 

7=0 ~ 9 7 



(if x , h + jg) n dn(x) + X N \\h + f E + jg\\ n 
= 2^J {K X) h)K x dfi{x)+A N (h + f M ),gj 
If / = /a N ' the above expression must be zero for all g, thus: 

= / {K X7 fx N - f E )nK x dii{x) + X N (h N ) 



7=0 



(3) 



Let's define a new operator T, 
def T : H — ► H 

One can check that T is self-adjoint since (Tf, g)n = J (K x , f)n(K x , g)-H<i/i(x) = 
(f,Tg)n- For an operator Q from one Hilbert space Hi, to another, H2, the operator 



norm of Q is defined by || Q\\c(Hi,h 2 ) '■— sup ||Qs||h- 3 . Our operator T is bounded, 

l|s|k x =i 

since by Cauchy-Schwarz, 

\\nl(H,n)< sup \\f\\ 2 n I I v / ^(^)^(x,z)^/F(i^)d/x(x)rf / i(z) < oo. 

Il/H-K=l J J 

We are going to use the spectral theorem next, but first let us review a few facts 
from functional analysis about this theorem (|7|). The spectral theorem allows one to 
define functions of a bounded self-adjoint operator on a Hilbert space H. If the func- 
tion is a polynomial, e.g. f(z) = ix 2 — hz + 2, then it is clear how to define the 
corresponding operator <j>(A) : 4>{A) — 3A 2 — 5A + 2. The spectral theorem extends 
the correspondence <j)(z) <-> cj)(A) to all continuous functions (in fact, to all bounded 
Borel functions). Moreover, one has ||0(^4)||£(ij,H) < sup{|0(z)| ; z S spec(A)}. 
Because is a real function, the operator 4>(A) provided by the spectral theorem is also 
self-adjoint. In addition, for each / € H , we have a measure Vf- t A on spec(A) such 
that (cj)(A)f, f)n = J spec iA) ^{^dvf-Aiz)- The measure vj-a is concentrated on that 
part of the spectrum spec(A) along which / has a nonzero component. In particular, if 
/ e Ker(A), then / is an eigenvector of A with eigenvalue 0, and Vf-A is a <5-measure 
concentrated on {0}. If on the contrary, / _L Ker(A), then ^/^({O}) = 0. 

Now, using the definition of the spectral measure Vf K T for the operator T and the 
function /e, we have from 0: 

= T(h N -f E ) + \ N (f XN ) 
=>■ An — /e = (T + \n) 1 (-Aat/ e ) 

=> \\k N -fE\\ 2 H = \\(T + \ N )- 1 \ N f E \\ 2 n =[ (zTT-) dv hAl) 

Jspec(T) \1 + *N J 

Since K (•, •) is positive semidefinite, T is a positive operator and thus has non-negative 
spectrum only. One can see that Ker T is empty, i.e. take any function i? such that T-d = 
0. Then, = (T#, = J $ 2 (x)<i/z(x); thus, $ must be zero almost everywhere. It 
follows that {0} is a set of measure zero for Vf % -T- As N — > oo, Aat — > 0, and the 
function ( 7 + a w ) conver g es to pointwise on K+, and thus almost everywhere with 
respect to ^/ e: t; since this function is bounded by 1, we can again use the dominated 
convergence theorem to say that the integral vanishes as N — > oo. One cannot give 
a more explicit bound for this term without more information about the relationship 
between fi and 7i. In any case, we have convergence of the bias term to 0. 

Now, we can complete the proof of Theorem 1 . For any 77 > 0, we must be able to 
show that limAT^oo P{\\fz N ,\ N — /e||w > ^7} = 0. So, let us choose an arbitrary fixed 
value 77. The bias term vanishes as N — > 00 and Ajy — > 0, so there must exist an Nq 
so that for N > Nq, Xn is sufficiently small, so that the bias term is bounded by 77/2. 
Thus, we consider the bias term bounded by 77/2 in the limit as N — > 00; since this 
term does not depend on the data, the bound clearly holds with probability 1 . Now, we 
must choose En so that the variance term is bounded by 77/2. Using the bound in 0, 

we choose en — v g" ■ The corresponding probability is then given by 0, 



_ QAMC 2 n 2 + 8M 2 X N _ QA{UMC 2 n 2 + 8M 2 AaQ 
PN ~ iVA^ _ NX^f ' 

We need pat to vanish as N — > oo; this is satisfied if NX^ — > oo as N — > oo. Also, 
there must exist an 7V such that for N > 7V , we have N > we need this in order 
to use the Algorithmic Stability Theorem. Thus, Theorem 1 is proved. □ 



4 Proof of Theorem 2 



This section contains the proof of Theorem 2. First, some notation. The positions xi, .., xjv 
will be considered fixed throughout this section. 

def P : H — >R N 

/^(/(x 1 ),/(x 2 ),..,/(x JV )) 

The 'evaluation operator' P evaluates a function / at each position Xj in the data set. 
Note that P 'loses information' about a function / by evaluating it at only N points. 
That is, Ker P is a nontrivial subspace of Tt. The adjoint P* : M. N — ► Ti of the 
operator P is given by P* : (ci, .., cat) i — ► J2iLi c i^ Xi - One can show that P is a 
bounded operator, with || P\ | c(h,H) = ^W^cm H) — ( max Eili l) 1 ^ 2 - The 

3 

operator P*P is automatically positive and self-adjoint. We will later use the spectral 
theorem on the bounded self-adjoint operator P*P. 

We start the proof of Theorem 2 with the following Lemma. 

Lemma 2.1: The following characterizations of f are equivalent: 

1. / satisfies: 

(i) 7(3) = for i = 1, .., N, and 

(ii) ll/Hw < \\g\\n V.g e W that satisfy g( Xl ) = />;)• 

2. / satisfies: 

AT AT 

J = Y, f(*i ) ^, , where W Xi = ]T Gr/ K xe (4) 

i=i £=1 

3. / satisfies: 

(i)7(xi) = /(xi)fori = l,..,_JV,and 

(iii) Vft e ifer P we have (/, = 0. 

Proof. We will show 1. — ► 2. — > 3. — ► 1. 

1. <-> 2. First, we show that the function described in 1. is unique. From the reproducing 
property, we know that / has nonzero components along each of the K Xi 's for which 
/(xj) 7^ 0. Since H is a Hilbert space, we can always decompose / into a component 
/|| within the span of the K Xi 's and a component f± orthogonal to each K Xi (where 
i€ 1,.., N). Now, 11/111,= II/,, \\ 2 n + \\h\\ 2 H > II/,, \\ 2 n . Thus, if ||/ ± || w ^_0,then7no 
longer has minimal norm and contradicts property (ii). The component of / along each 



of the i^ x , 's is determined by the value of /(x^). So, functions / that satisfy both (i) 
and (ii) can be written / = /|| = Yli=i a i^ Xi for the fixed values of a i7 i = 1, ..,N. 
In particular, the a^'s must satisfy: 

N N 



1=1 



Thus, the function described in 1 . is unique. It is straightforward to see that the function 
described in 2. is exactly the function described in 1. Evaluating the right side of @ at 
Xj , we obtain 

N N 
3=1 1=1 

Moreover, the function described in 2. lies entirely within the span of the K Xi 's. There- 
fore it obeys (i) and (ii) and we have 1. <-> 2. 

2. — > 3. Because 2. — > 1., (i) is satisfied. We just need to show (iii). 
For any ft, S ifer P, 

(/i, K^^u = V £ € 1, .., iV by the reproducing property, 

(h, W Xi )n = V £, i € 1, .., N because the W x /s are each a linear combination of the -ftT x /s. 
(h, f)n = because / is a linear combination of the W Xj 's, thus (iii) holds. 

3. — * 1. Here, (i) is automatic, so we need to check (ii). 

Take arbitrary g <^TL with: g(xi) = / (xj) = /(Xj) for i=l,..,N. 
Then, g — f 6 iter P. 

From assumption (iii), (/, g — = 0, 
and thus (f , 9)n = \\7Wn- 

Now, 

11^ = 113-7+711^ 

= llfl-/llw + ll/llw + 2(ff -7,f)n 

= I|£-/Ilw + Il/Ilw 

> 1 1 / II n , with equality only if g = / . 

Thus we have 1. — > 2. — ► 3. — > 1., so Lemma 2.1 is proved. □ 

Back to the proof of Theorem 2. The functional Lz t ,x t (/) in the Main Algorithm 
expressed in terms of P becomes: 



L Zu x t U) = jj\\Pf - (Pf + + A t ||/| m 



2 



The minimizer of Lz t ,x t (/) must satisfy -§^Lz t ,x t (/ + 7^)| 7 _ = 0. In other words, 

the first variational derivative of Lz t ,\ t {f) is at its minimizer. Recalling that Pf = 
Pf, this minimization problem becomes: 

o = |- [i(P/ + iPh -Pi- jb, pf + 7 Ph -Pi- ~b), 2 + \t\\f + 7 h\\ 2 n ] I 

= 2i(P/ l , Pf-Pj- -h) t2 + 2A t (/, 

= 2(/i,^P*P/-^P*P7--j-P*(ib) +A t /) . 
V ' AT TV J AT Vt J J In 

This must be true for any function h, so ±P*Pf - ^P*P/ - j^P* (jb \ + X t f = 0, 
implying 

f = l-(^P*P + Xt)- 1 Xj+(^P*P + Xt)- 1 ^P*(-b) . 
It follows that 

IIZ-Jll^^lK^P + AO^AJIlH + ^IK^P^P + AO^Pll^^llbll^. 

In order to show stability, we bound the two terms on the right of (|5Jl, and construct 
these bounds so they vanish as t — ► oo. That is, we need to bound the norms above. To 
accomplish this, we will use the spectral theorem on the bounded self-adjoint operator 
P*P. 

To bound the first term in equation (|3J, recall that the operator obtained from the 
function <f>t{z) = (-kz + X t )~ 2 X 2 of the self-adjoint operator P*P is self-adjoint. 
Also, since P*P is a positive operator, the spectrum spec(P*P) of the operator P*P 
is concentrated on R + U {0}. Using the spectral measure vj. P , P (z) on the spectrum 
spec(P*P), we find: 

||(ip*P + Xty'Xjfn = ([(^P*P + A t ) _1 A f ] 2 7, l)n- 

= [ ( 1 — — ) dvj. ptp (z) 

Jspec(P'P) \Jj z + X t J 

Because A t ^0, we have <p t (z) = (-fez + X t y 2 X 2 t ^0 for all z E R+ \ {0}. By 

Lemma 2.1, we know that / _L Ker P, and since Ker P — Ker P*P, we know that 
^/•p*p({0}) = 0. Since 4> t {z) < 1 for all z G spec(P*P) C K + , it then follows from 
the dominated convergence theorem that ||(^P*P + A t ) -1 A t /||^ ^0. One cannot 
give a more explicit bound for this first term; it would require more specific knowledge 
of the relationship between [i and H. In any case, we have achieved our goal in showing 
that the first term of vanishes as t — > oo. 

We need the second term in equation (|5} to vanish also. Recall that for operator 

1/2 

Q : Hi — > H 2 , it is true that \\Q*\\c{h 2 ,h 1 ) = \\Q*Q\\c(H 1 ifi)' an ^ tnat a continuous 



real function of a self adjoint operator such as P*P is self adjoint. 

We use the spectral theorem for the bounded self-adjoint operator A = P*P, namely 
the fact \\^(A)\\c(h,h) < sup{\<p{z)\; z G spec(A)} where tj>(z) = { i_ z l Xt)2 here. 
The maximum value of <f>{z) occurs at z — NX t , and it is Thus, 



\(lp*P + \ t )-ip*\\ c{ e 2tn) < vA 



, N u^„.j - 2 r^ 



■^r\\{^P*P + Xty 1 P*\\c<e 2 w)||b|k < ^= ||b|k < —^—b max a.s. 

As long as we design At so that ty/Xi ' — * oo, then we have the desired convergence of 
this term to 0. We are done with the second term of equation (|5}- Theorem 2 is proven. 

□ 



5 Conclusion 



We have proved stability for the regularized least squares regression algorithm, for the 
sense in which inverse problems are examined. We have shown stability for this algo- 
rithm in two cases: the case when the number of data points N is a constant, and the 
case where N — > oo. It is important that our algorithm is stable in this sense, because 
we do not want any inherent error in the algorithm's output. Neither a small amount of 
noise in the data nor a small amount of regularization should drastically influence the 
algorithm's output. 

We hope that the reader will gain more from our result than the knowledge that 
regularized least squares regression is stable in the inverse operator sense. We have 
found the particular methods introduced in the proofs of Theorem 1 and Theorem 2 
useful for various learning problems, especially those which require the convexity of 
learning functionals or convergence of learning algorithms. Namely, we demonstrate 
two methods for showing that the minimizers of two learning functionals are close: use 
of the spectral theorem, and the technique of Lemma 1.1, which can both be generally 
applied to other learning algorithms. 



6 Acknowledgements 

The author would like to express infinite gratitude to Ingrid Daubechies. 



References 



1. O. Bousquet and A. Elisseeff. Algorithmic stability and generalization performance. In Ad- 
vances in Neural Information Processing Systems 13: Proc. NIPS'2000, 2001. 



2. F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin (New Series) 
of the American Mathematical Society, 39(1):1,49, 2002. 

3. L. P. Devroye and T. J. Wagner. Distribution-free performance bounds for potential function 
rules. IEEE Trans. Inform. Theory, 25(5):601-604, 1979. 

4. B. Heisele, A. Verri, and T. Poggio, Learning and Vision Machines. IEEE Visual Perception: 
Technology and Tools, 90(7): 1164-1 177, 2002. 

5. G.S. Kimeldorf and G. Wahba. Some Results on Tchebycheffian spline functions. Journal of 
Mathematical Analysis and Applications, 33:82-85, 1971. 

6. S. Mukherjee, E. Osuna, and F. Girosi. Nonlinear Prediction of Chaotic Time Series Using 
Support Vector Machines, IEEE Workshop on Neural Networks for Signal Processing VII, 
1997. 

7. M. Reed, and B. Simon, Methods of Modern Mathematical Physics I: Functional Analysis. 
Academic Press, San Diego, CA, 1980. 

8. B. Schoelkopf, and A. Smola, Learning with Kernels - Support Vector Machines, Regulariza- 
tion, Optimization, and Beyond. The MIT Press, Cambridge, MA. 2002. 

9. A.N. Tikhonov and VYArsenin. Solution of Ill-Posed Problems. Winston, Washington, DC, 
1977. 

10. V. Vapnik. Statistical Learning Theory. John Wiley and Sons Inc., New York, 1998. A Wiley- 
Interscience Publication. 

11. V. Vapnik. The Nature of Statistical Learning Theory. Springer Verlag, New York, 1995. 



